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*lctea f disclosure >' 

1 . Describe your Invention, stating the problem solved (if appropriate), and indicating the advantages of 
using the invention. 

The rate that code ts being developed for existing platforms Is increasing each year. Investment in this 
software can often exceed the cost of the actual system. In addition, when new hardware is created, 
often a time lag exists from when the architecture ts available and the time to migrate to the new 
architecture. Often, to support this legacy oode. the processors keep a portion of the silicon dedicated to 
legacy bgic that maintains backward compatibility to the prior machine instructions. The machine is 
placed on the market with the potential of achieving greater performance once the new software is 
obtained. As such, the performance gain of follow on processor architectures are often not realized by 
the end user. 

This disclosure proposes a new hardware and software solution to more fully utilize the newer 
architecture features of a microprocessor. The mechanism will work real time while the pr ogram is being 
run by the system. The essential idea Is that a fully integrated structure is included into the system 7 
architecture including the processor that will allow oode translation of legacy oode to enhanced code. / 
Unnke prior attempts, this mechanism is not performed instream to the instruction fetches. It uses 
another approach that avoids the penalties that occur whenever real time code optimization is attempted. 

At the system level, the optimization occurs as the program is being loaded into the main memory, the L3 
cache, the 12 cache, and the L1 cache. While the program is running, the external optimizer is optimizing 
a page at a time of the legacy code. This optimization occurs even while the original non-optimized oode 
is running. When the code Is optimized, then the legacy page is replaced with the optimized page. From 
that point forward, the user program would see the performance boost of the optimized page. 

A mechanism is built in to maintain a copy of the optimized page for future relg^in^Hhe-applteatiGit. 
This includes a mechanism to store the optimized oode a nd processor state j ^rgnon^votatile stor aga^ . 
device when the system is shut down. When the system is restarted, the use? can Choose to continue 
from this same state so it doesnt have to rsoptimize the code. This allows a program to be optimized 
once and simply re-executed whenever it is used In the future. 

A description of the invention is provided below, 

2. How does the invention solve the problem or achieve an advantage^ description of "the invention", 
Including figures inline as appropriate)? 

I) Startup 

Upon^lacRlne start up, the operating system starts loading the requested programs from the permanent 
memory spaoe, usually a hard disk area. These programs are the applications selected by the user. 
Assuming that no prior optimized pages exist, the program code will be non optimized. The target 
processor that the request is made is in this order. The processor calculates a real address, and looks up 
in its page table to see if the page has been generated before. If not, a new page address will be 
generated, and the memory load from the hard disk will occur. For optimum performance, the load will 
satisfied ASAP from the hard disk space to external main memory, to the L3 cache (optional), to the L2 
cache, to the Lt cache, and to the dispatch unit The processor will continue to fetch memory locations to 
run the selected applications, and valid not optimized pages will exists for the loaded applications. In 
most cases, these will be Incorrplete page locations In the caches, but will be a complete page in the 
Main memory. At this stage of operation, the machine will function In same manner as legacy machines. 



It) Instruction Optimizer 

The Optimizer consists of a "processor" and associated control logic whose task is to convert 
non-optimized instructions into a stream of instructions that execute faster on the target processor. 
Collectively, the "processor^ and control logic are referred to as the Instruction Reooder. 
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The Instruction Recodar ^s truly separate fror^ th rest of the system - it may operate at a separate clock 
speed from the target processor and tt Is not affected by occurences within the main system (e.g., 
interrupts, exceptions). It slmpfy read§Jfi@tai££iQQsfrom a page of main memory, reschedules them for 
higher performance and ^res^gf frjrT^ main memory. The Instruction decoder carYbe^ 

lrnplemeDte**»4wo way^tjj^croooritrolter th^ executes receding algorithm stored in ROM or 2) 
^J^artfwted togic^) ^ — ^ 

SirfcslS^filhe Instruction Recoder and target processor have access to main memory, this invention 
requires thai main memory be multiported (two read ports and two write ports). This allows both 
processes to operate at the same time without causing structural hazards. In other words, the instruction 
Recoder can read one non-optimized instruction and write one optimized instruction per cycle, and the 
target processor can read/write one Item from this same memory element. In another embodiment the 
Instruction Recoder requests a priority interrupt to the memory subsystem whenever rt needs to access 
main memory, and it is granted access when the target processor is not using the memory bus. 

The Instruction Recoder must be able to detect self-modifying code_a nd react accordingly to ft. This can 
be implemented by a direct comrminication from the target processor to the Instruction Recoder, or it can 
be detected by the Instruction Recoder when the L2/L3 caches write back to main memory. In any case, 
the detection of modified code causes the lrt<s *2^ffi rt R fiftn r ^ >r trt - 1 ) Its optimized code to ensure 'rt 
is still valid and make changes if needed .QR ^Disable optimization fo Lthe entire section of 
self-modifying code. In essence, the Instruction necoaer must account for any coda changes after the 
non-optimized Instructions are loaded Into main memory. 

The actual work of optimizing code is straightforward. The Instruction Recoder knows the architecture of 
the target processor, including the following things: 

• . Number of execution units, types of units (e.g., 4 integer, 2 floating point 2 multimedia) 

• Latency of each operation (e*g., 1 cycle for add, 3 cycles for multiply) 

• Number of architected registers, number of ports on register files and caches 

Therefore ft knows how much parallelism is In the target processor, how fast operations execute and the 
limits for getting data to/from the processor. Knowing this information, the Instruction Recoder analyzes 
the non-optimized Instruction strear ntefind a "more efficient* way to sc hedule the instructions. "More 
efficient" means that the new cod^biiz BS more resources in parailej j&revents execution units from 
going idle) and It has less pipeline ±JlaliK. 1 1 tw Instruction Ftecooer uses common techniques used by 
compilers (such as Loop Unrolling) to improve the scheduling of instructions, if the target processor has 
a superior architecture than the one assumed for the non-optimized code, then the Instruction Recoder 
should be able to improve the instruction scheduling such that the Optimized code executes faster than 
the Non-optimized code, in addition, because the act of optimizing code is not in the Ketch path for the 
target pSftoessor, there Is no penalty for creating the optimized oode. In fact, this is transparent to the 
user l entil target processor starts executing the Optimized code , and then they experience the 
performance Improvement 

iii) Mechanism to write optimized code to lower level caches (L1 and L2) 

Once the optimized page has fully been updated in the main memory fin other words, a duplicate 
page has been created, but has been 

optimized for the new j^iocessfl LaidlftQ^ caches must be updated. Since these are 

* pages in which are (newly assigng&J^ 'T^'N 

there is zero risk toTDTining^pplfcafons toprefoajitnese pages into the L3, L2, and the L1 . The system 
logic has full access to the L3, and 

often the L2 caches so the preloading of these caches could follow existing protocol. Preloading the L1 
caohe would not be required, but 
^^ fo i uptft rt m i t pei f ( m waor^may be desired. The mechanism to preload the L1 would be to use standard^ 
( JDMA cycle steal functionslj &he 
me system memory managen 



-tere 



» system memory management would monitor the state of the processor function. Whenever 
bandwidth was available on the bus, the 
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system memory management system would DMAC the update page entries Into the L1 cache. Again, 
since these are new pages, the 

running applications would not risk corrupting the running application. 



--TU5 

age Table Buffer, khat stores a n number of 



Iv) Mechanism to upQale page table in Main Mem^ and TLB in CPU 

Once a suyi^ ^rl^riSn of the new pag g>as beenupda^^ the old legacy 

page is ready to beswapped out ^ — 
for the new<f optimized page. Within the processor is i 
active pages. These pages are stored 
In the TLB to reduce the numbers of time that the page translation address generation would occur. The 

larger the TLB, the more efficient , — — — k — - 

wlfl be the memory transactions. This Invention fi dds a DMAC p ath into the TLB^tne memory system 

already knows the page ^fitrjrthat — ' ' ~" 

wbslqow optimized. A TLB DMAC cycles occur wtfere the existing TLB page entry Is Invalidated, In the 

rsirTirpiest case thfs would orsaffident- — — ^ ^ ■ 

to pick up the new page since the processor would now have to go to the operating system and 
recalculate the page address, and in this 

case^the assigned page address could be the new optimized page entry. The prelemed embodiment, 

I the new optimized page entry Into the processor TLB. Then validate thepew page, and 
ilidale delegacy page. Now, when - 

r to page lookup oocurs, and tha index address b in a safe match, or adjusted (descrtoed in 
Jack section) the new opimized page wiH 
v^lid and the processor will continue at tills point. 

/ A cbcurt wifl exist on tha processor to allow up to n address to be loaded tor address translation/jump to 
/ * a new location within the optimized 

/ page. This buffer will be preloaded by the system memory control with the DMAC protocol or via a IO 



f serial port asj 



. ped-___ 

registers/The Safe transition comparatorvftll be comparing current legacy page index with translated 
indexes and swell ly if a slortW I iiaTijll 
oocurs. If match occurs, then the page swap will be allowed as described, and the new t ranslated index 
will be substitirtgctfo^ — c ' 

^_Jnd8«r+nnfBsmanner the code will be functionally re-aligned and operatbn can continue from the new 
optimized papa. Since the new 

optimized page is preloaded, the user will not sea a cache miss penalty, BUT from this point on would 
have improved performance for 
this page of the applications. 



v) Optional mechanism to invalidate non-optimized code in L1/L2 caches to free up space 

The legacy page will now be stale date In the L1 , L2 and L3 caches. Two methods could exists to free 
up this space. One method would 

be to allow the existing LRU to remove the entries. Since no further requests will occur for these pages 
(since the new optimized pages 

are now the valid translations), the data will become LRU naturally, and be removed. An alternative 
means could be implemented so that 
once a legacy page has been swapped with the optimized page, the cache control logic could pro 
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active update an entries in the 

cacha tags to look up page index (stored In the Tags), and when ft recognizes a replaced tag entry (in a 
stored register saying last 

replaced page), the LRU value would be updat for the line entry to mark the legacy page as LRU- In 
this manner, if a legacy page 

had been recently used, and some other valid data still exists for that fine, the legacy line entry will be 
marked for replacement and not 

the still .valid entry. This optional portion of the invention could improve L1 , and even L2 cache hit rates. 



vi) Method to write CPU state and optimized code to virtual disk at shutdown 

Upon shut down, additional hardware resources can be allocated to preserve the current optimized 
state of the user application codes. 
If the system is powered down wfthout this restoration means, then the next time the system is powered 
up, the user will start with the 

non-optimized legacy code, and the machine learning/optimizing of the application will have to re-occur. 
Some additional claims of 

this invention is the structure to maintain the optimised state of the processor. Tfce valid optimized page 
entries coota jned in memory — — — — n 

will bq&Srecfin a new optimized page address memory/The system upon shutdown, will read the Page 
Entry MerTirjT7tPEM]7an3 "T ^r — — — \ 

write these locations of main memory into affiecowry Optimized Page Disk (ROPDu In addition the 
PEM contents will be saved 1 ■ / 

onto the ROPD. — " 

Upon restart, the system win first recover the contents of the ROPD into the main memory, but not into 
the caches. Now, as the user 

runs prior optimized code, the code will still exist in main memory, and the user will always see the 
optimized results, without the optimizer 

having to repeat the whole page optimization routine. 



vfl) Method to control the speed of the optimizer (# of instructions optimized per second) 
How much effort* is done to optimize each instruction? Is the optimizer sufficiently ahead of the 
processor? If it is, then me optimizer spends more time optimizing the code. Otherwise, the optimizer 
reduceSTts Sffort This is useful at startup when the processor and optimizer are working wfth the same 
instructions. 

One option the user may have when loading applications is to be asked whether or not to allow the 
optimizer to optimizer the page before allowing 

any execution of the page. While this is not the core of the invention, this oould be allowed to occur for 
applications that are not timing critical 



Due to the unique nature of this invention, code can begin executing immediately upon the processor 
without the enhanced features of the processor, and then in real time, the code is shifted over to the 
enhanced architecture mode without the need of disrupting the service of the 

machine. The user can have the choice of saving the upgrade, so that future system use will not have the 
initial performance penalty. 
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